</font>
ï‚· Cement : measured in kg in a m3 mixture
ï‚· Blast : measured in kg in a m3 mixture
ï‚· Fly ash : measured in kg in a m3 mixture
ï‚· Water : measured in kg in a m3 mixture
ï‚· Superplasticizer : measured in kg in a m3 mixture
ï‚· Coarse Aggregate : measured in kg in a m3 mixture
ï‚· Fine Aggregate : measured in kg in a m3 mixture
ï‚· Age : day (1~365)
ï‚· Concrete compressive strength measured in MPa
</font>
The aim of the dataset is to predict the concrete compressive strength of high performance concrete (HPC). HPC always does not mean high strength but covers all kinds of concrete for special application that are not possible with standard concretes. Therefore, our target value is 'Concrete Compressive Strength [MPa] or strength column in MPa'.
</font>
# importing the necessary package for performing advanced mathematical operation
import math
# importing the necessary package for managing data
import pandas as pd
import numpy as np
# importing the necessary packages for visualisation
import seaborn as sns
from matplotlib import pyplot
import matplotlib.pyplot as plt
sns.set (color_codes = True)
%matplotlib inline
# commmand to tell Python to display my graphs in a darkgrid format
sns.set_style(style= 'darkgrid')
# for doing statistical calculation
import scipy
from sklearn import linear_model
import statsmodels.api as sm
from sklearn import metrics
from sklearn import datasets
import scipy.stats as stats
from scipy.stats import skew
# pre-processing method
from sklearn.model_selection import train_test_split
from scipy.stats import zscore
# Import Linear Regression machine learning library
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
import xgboost
from xgboost import XGBRegressor
# For creating Polynomial Features
from sklearn.preprocessing import PolynomialFeatures
# methods and classes for evaluation
from scipy.stats import randint as sp_randint
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_validate
from sklearn.utils import resample
import warnings
warnings.filterwarnings("ignore")
df = pd.read_csv('concrete (1).csv')
df.head()
def indetailtable(df):
print(f'Dataset Shape: {df.shape}')
print('Total Number of rows in dataset= {}'.format(df.shape[0]))
print('Total Number of columns in dataset= {}'.format(df.shape[1]))
print('Various datatypes present in the dataset are: {}\n'.format(df.dtypes.value_counts()))
summary = pd.DataFrame(df.dtypes, columns = ['dtypes'])
summary = summary.reset_index()
summary['Name'] = summary['index']
summary['Name'] = summary[['Name', 'dtypes']]
summary['Misssing_values'] = df.isnull().sum().values
summary['Unique_values'] = df.nunique().values
summary['Duplicate_values'] = df.duplicated().sum()
summary['1st value'] = df.loc[0].values
summary['2nd Value'] = df.loc[1].values
summary['1028th Value'] = df.loc[1028].values
summary['1029th Value'] = df.loc[1029].values
return summary
brief = indetailtable(df)
brief
There are basically 1030 number of data points / observations and 08 number of columns/features available in the dataset.
All the columns are numeric in nature, they are either float or integer type. Here the the column 'strength' is the target column which predicts the concrete compressive strength of high performance concrete (HPC).
HPC does not mean high strength but covers all kinds of concrete for special applications that are not possible with standard concerts.
The rest of the columns are independent columns, and all are numeric in nature.
Here, we don't have missing values but certainly we have some duplicate columns.
A lot of unique values can be observed from each of the columns.
Further, we will explore more for duplicate columns and impute the proper treatment for avoiding any biasness by our models.
Since, the target column is a continuous column, we can expect the models to be linear models rather than classification models.
df.columns
for value in ['cement', 'slag', 'ash', 'water', 'superplastic', 'coarseagg',
'fineagg', 'age', 'strength']:
print(value,':', sum(df[value] == '?'))
df.isnull().values.any()
duplicate_rows_df = df[df.duplicated()]
print('Number of duplicated rows:', duplicate_rows_df.shape )
Hence, there are 25 rows with duplicate elements present in each column, which can be seen as repeated in nature and can be treated as irrelevant. This problem could be due to things like data entry errors or data collection methods. Whatever the reason may be, duplication may lead to make incorrect conclusions by leading you to believe that some observations are more common than they really are.
duplicate_rows_df_cement = df[df.duplicated(['cement'])]
print(duplicate_rows_df_cement.shape)
print(len(df.cement.unique()))
Hence it has total 278 unique values, by adding the unique and duplicated values for the column cement (278+752 = 1030) we have 1030 total number of rows. So it is necessary to retain the unique rows for better analysis, so at least we can drop the rows with same values in all column.
print('Shape of dataframe after dropping duplicates', df.drop_duplicates().shape)
Thus we have deleted total 25 number of duplicated rows. Now the data frame seems to be legit for further EDA.
df.describe()
In the above table 5 point summary have been described, i.e., minimum value, 25%, 50%, 75%, and maximum values of each column.
The statistical table above shows the distribution of each attribute by signifying the unique values of each column.
Some of the columns have zero values in it. It means these attribute constituent are not present in the manufacturing of the concrete. This might have happened intentionally to test the strength of the concrete by excluding the constituents.
The age column has a minimum value of 1 in it. This value signifies the bare minimum incubation period allowed for the concrete to create the atomic bonding between each constituent. Normally 28 days is the recommendation time to ensure the correct results. However this 1 day period or days below 28 gives less strength to the concrete.
Similarly, the distribution of strength column shows the minimum value of 2.33 MPa, means a week concrete with less compressive strength, this might have happened due to improper mixture ratio of constituents, or extremely more or extremely less quantity of water or may be due to very less incubation period of concrete.
Though, the columns have a single unit i.e. kg/m^3, except age and strength, they have a huge range of values or scale of values, which may bias the model performance.
So to avoid this kind of problem a proper scaling is required may be by MinMaxScalar or StandardScaler or simply by imputing the zscore.
# Cheking IQR
IQR = df.quantile(0.75) - df.quantile(0.25)
IQR
# Checking range
print( 'Range in Cement:', df.cement.max() - df.cement.min())
print('Range in Slag:',df.slag.max() - df.slag.min())
print('Range in Ash:',df.ash.max() - df.ash.min())
print('Range in water:',df.water.max() - df.water.min())
print('Range in Superplastic:',df.superplastic.max() - df.superplastic.min())
print('Range in Coarseagg:',df.coarseagg.max() - df.coarseagg.min())
print('Range in fineagg:',df.fineagg.max() - df.fineagg.min())
print('Range in Age:',df.age.max() - df.age.min())
print('Range in Strength:',df.strength.max() - df.strength.min())
# Checking for the 1.5*IQR for presence of outliers
1.5*IQR
(df.max()-df.min()) - 1.5*IQR
Thus by subtracting the 1.5*IQR from range we can get the spread of the outliers beyond the quartile range or whisker.
The above difference shows us the clear-cut presence of outliers beyond upper quartile range.
# Checking the total number of unique values for target column
pd.value_counts(df['strength'])
plt.figure(figsize = (20,4))
plt.subplot(1,3,1)
plt.hist(df.strength , color = 'blue', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Strength of Concrete in MPa');
plt.subplot(1,3,2)
sns.violinplot(df['strength'], color = 'blue')
plt.xlabel('Strength of Concrete in MPa');
plt.subplot(1,3,3)
sns.boxplot(x = df.strength,palette = 'YlOrRd')
plt.xlabel('Strength of Concrete in MPa');
plt.figure(figsize = (20,12))
plt.subplot(2,2,1)
plt.hist(df.cement , color = 'green', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Quantity of Cement in Kg per cubic meter of Concrete')
plt.subplot(2,2,2)
sns.violinplot(df['cement'], palette = 'gist_rainbow')
plt.xlabel('Quantity of Cement in Kg per cubic meter of Concrete')
plt.subplot(2,2,3)
sns.boxplot(x = df.cement, palette = 'copper')
plt.xlabel('Quantity of Cement in Kg per cubic meter of Concrete');
plt.subplot(2,2,4)
sns.scatterplot(df.cement, df.strength, color = 'green')
plt.title('Variation of strength with respect to Slag content', color = 'brown');
Cement is one of the independent attribute, which plays a major role in building compressive strength of the concrete. By adding more cement, strength of the concrete increases.
However, if excessive amount of cement is added to the concrete then high heat of hydration will be generated, which in turn induces the thermal stresses in concrete by creating cracks.
The distribution of the data for this column is not normal. The column seems to be right skewed and we don't have outliers present in it.
plt.figure(figsize = (20,12))
plt.subplot(2,2,1)
plt.hist(df.slag , color = 'violet', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Quantity of slag in Kg per cubic meter of Concrete')
plt.subplot(2,2,2)
sns.violinplot(df['slag'], color = 'violet')
plt.xlabel('Quantity of slag in Kg per cubic meter of Concrete')
plt.subplot(2,2,3)
sns.boxplot(x = df.slag, color = 'teal')
plt.xlabel('Quantity of slag in Kg per cubic meter of Concrete');
plt.subplot(2,2,4)
sns.scatterplot(df.slag, df.strength,color = 'violet' )
plt.title('Variation of strength with respect to Slag content', color = 'brown');
Slag is a by-product generated out from blast furnace. This constituent plays a major role in deciding the strength of concrete.
It acts like a binder and helps to increase the durability and strength of the concrete. However, the hardening process takes longer time to reach full compressive strength.
Initially the distribution seems to be normally distributed, however as the data point increases it becomes right skewed with a long tail representing the presence of outliers.
Multi modal distribution signifies the two clusters due to mix of Gaussians. This might have happened during data collection.
Box-whisker plot shows the presence of outliers beyond upper quartile range.
plt.figure(figsize = (20,12))
plt.subplot(2,2,1)
plt.hist(df.ash , color = 'orange', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Quantity of ash in Kg per cubic meter of Concrete')
plt.subplot(2,2,2)
sns.violinplot(df['ash'], color = 'orange')
plt.xlabel('Quantity of ash in Kg per cubic meter of Concrete')
plt.subplot(2,2,3)
sns.boxplot(x = df.ash, palette = 'gnuplot2')
plt.xlabel('Quantity of ash in Kg per cubic meter of Concrete');
plt.subplot(2,2,4)
sns.scatterplot(df.ash, df.strength, color = 'teal' );
plt.title('Variation of strength with respect to Fly ash content', color = 'brown');
Fly ash is produced in small dark flakes by the burning of powdered coal.
It acts like a good binder like slag to improve the durability and strength of the concrete. However, the hardening process takes longer time to reach full compressive strength.
By addition of fly ash reduces the concrete bleeding and improves its workability. Fly ash can improve the long-term compressive strength of the conventional concrete.
The distribution of the column seems to be a complete mix of Gaussians. A gap or discontinuity between two Gaussian indicates the presence of two clusters in the column.
Though the distribution seems to be right tailed, the absence of outliers can be observed in this.
plt.figure(figsize = (20,12))
plt.subplot(2,2,1)
plt.hist(df.water , color = 'pink', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Quantity of water in Kg per cubic meter of Concrete')
plt.subplot(2,2,2)
sns.violinplot(df['water'], color = 'pink')
plt.xlabel('Quantity of water in Kg per cubic meter of Concrete')
plt.subplot(2,2,3)
sns.boxplot(x = df.water, palette = 'inferno')
plt.xlabel('Quantity of water in Kg per cubic meter of Concrete');
plt.subplot(2,2,4)
sns.scatterplot(df.water, df.strength,color = 'purple' );
plt.title('Variation of strength with respect to water content', color = 'brown');
Water plays a role of binder between all the constituents to create a concrete.
An incorrect proportion of water in concrete may lead to less strength or cracks.
Too much of water or over-watered concrete may lead to lower strength, reduced durability, shrinkage cracking and a variety of surface problems. Thus, proper amount of water to be added to concrete with all other constituents to ensure the desired strength.
The Distribution of the column seems to be multimodal in nature with more than three clusters.
The long extended tails in the distribution to both the sides indicates the leverage of outliers in the attribute.
From the Box-Whiskers plot we can observe the presence of outliers on both the sides.
plt.figure(figsize = (20,12))
plt.subplot(2,2,1)
plt.hist(df.superplastic , color = 'purple', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Quantity of superplastic in Kg per cubic meter of Concrete')
plt.subplot(2,2,2)
sns.violinplot(df['superplastic'], color = 'purple')
plt.xlabel('Quantity of superplastic in Kg per cubic meter of Concrete')
plt.subplot(2,2,3)
sns.boxplot(x = df.superplastic, color = 'white')
plt.xlabel('Quantity of superplastic in Kg per cubic meter of Concrete');
plt.subplot(2,2,4)
sns.scatterplot(df.superplastic, df.strength,color = 'violet');
plt.title('Variation of strength with respect to Superplasticizer content', color = 'brown');
The Superplasticizer is used to ensure better flow properties because they minimize particle segregation. Further, they allow to decrease the water-cement ratio which leads to higher compressive strength.
In the case of hardened concrete the use of Superplasticizer will increase compressive strength by enhancing the effectiveness of compaction to produce denser concrete.
The distribution of the column clearly indicates the mix of Gaussian. The distribution curve seems to be bimodal in nature with uneven distribution.
The right skewed data with right long tail in violin plot shows indicates the presence of outliers in right side.
The box-whisker plot shows the presence of outliers in right side at a very high distance from upper quartile.
plt.figure(figsize = (20,12))
plt.subplot(2,2,1)
plt.hist(df.coarseagg , color = 'red', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Quantity of coarseagg in Kg per cubic meter of Concrete')
plt.subplot(2,2,2)
sns.violinplot(df['coarseagg'], color = 'red')
plt.xlabel('Quantity of coarseagg in Kg per cubic meter of Concrete')
plt.subplot(2,2,3)
sns.boxplot(x = df.coarseagg, color = 'orange')
plt.xlabel('Quantity of coarseagg in Kg per cubic meter of Concrete');
plt.subplot(2,2,4)
sns.scatterplot(df.coarseagg, df.strength, color = 'orange');
plt.title('Variation of strength with respect to coarse aggregate content', color = 'brown');
Coarseagg stands for Coarse Aggregate column. It generally consists of sand, Grave and crushed stone.
The larger percentage of coarse aggregate in concrete mix makes it to contribute a lot to the strength of concrete. However, the tensile strength of the concrete gets severely affected by increasing the size of coarse aggregate.
In this column, the distribution seems to be multimodal and it can be due to mix of Gaussian once again and may be due to uneven distribution of data points.
This column doesn't have any outliers so it is free from baseness arises out of outliers.
plt.figure(figsize = (20,12))
plt.subplot(2,2,1)
plt.hist(df.fineagg , color = 'white', edgecolor = 'red', alpha = 0.7);
plt.xlabel('Quantity of fineagg in Kg per cubic meter of Concrete')
plt.subplot(2,2,2)
sns.violinplot(df['fineagg'], color = 'white',edgecolor = 'red')
plt.xlabel('Quantity of fineagg in Kg per cubic meter of Concrete')
plt.subplot(2,2,3)
sns.boxplot(x = df.fineagg, palette = 'gist_rainbow')
plt.xlabel('Quantity of fineagg in Kg per cubic meter of Concrete');
plt.subplot(2,2,4)
sns.scatterplot(df.fineagg, df.strength, color = 'red');
plt.title('Variation of strength with respect to fine aggregate content', color = 'brown');
fineagg stands for Fine Aggregate, it consist of small size of coarse aggregate with fine crushed stones. It mostly contains the sand.
This increases the flexural strength of the concrete. However the workability of concrete decreases as fine content increases.
It helps in binding cement properly and thus increases the strength of the concrete. Beyond certain level of addition, it decreases the strength of the concrete.
The distribution of the column is nearly normal distribution. However, the right tailed data points indicate the presence of outliers.
The same can also be inferred from the Box-Whisker plot.
plt.figure(figsize = (20,12))
plt.subplot(2,2,1)
plt.hist(df.age , color = 'brown', edgecolor = 'green', alpha = 0.7);
plt.xlabel('Incubation period in number of days')
plt.subplot(2,2,2)
sns.violinplot(df['age'], color = 'brown',edgecolor = 'green', alpha = 0.7)
plt.xlabel('Incubation period in number of days')
plt.subplot(2,2,3)
sns.boxplot(x = df.age,palette = 'gist_stern')
plt.xlabel('Incubation period in number of days');
plt.subplot(2,2,4)
sns.scatterplot(df.age, df.strength,color = 'brown');
plt.title('Variation of strength with respect to incubation period', color = 'brown');
Age is a deciding factor for increase in strength. The compressive strength of the concrete increases with age. However, this increase in strength gets stagnated after one year.
As per the industry standard, 28 days period is the basic period to attain a level of compressive strength.
In the histogram plot most of the data points are located between 0 to 100 days, and 0 to 25 days period has maximum count.
Rarely some of the concrete were allowed to solidify and gain strength till a period of 350 days. The presence of multi Gaussian and multi modal indicates 3 to 4 clusters. And the violin plot clearly shows the skewness of data points towards right. The very long tail towards right indicates the presence of outliers in the column.
The same can be noticed from box-whisker plot, and the outliers are located far away from the upper quartile range.
fig, ax = plt.subplots(figsize=(10,7))
sns.scatterplot(y="strength", x="cement", hue="water", size="age", data=df, ax=ax, sizes=(50, 300))
ax.set_title("strength vs (cement, age, water)")
ax.legend(loc="upper left", bbox_to_anchor=(1,1))
plt.show()
It can be seen that, as the quantity of cement increases the strength also increases, and this is possible at a certain ratio of cement to water and at a certain age period.
Some points show that, at a less proportion of cement also concrete has a good strength and it is due to the proper quantity of water in it and a good time period to strengthen the concrete.
fig, ax = plt.subplots(figsize=(10,7))
sns.scatterplot(y="strength", x="fineagg", hue="ash", size="superplastic", data=df, ax=ax, sizes=(50, 300))
ax.set_title("strength vs (fineagg, superplastic, ash)")
ax.legend(loc="upper left", bbox_to_anchor=(1,1))
plt.show()
A linear relationship cannot be expected between fine aggregate and strength column. At extreme proportion of fine aggregate, ash quantity seems to be very less with minimum superplastic and a moderate value of strength.
An optimal quantity of fine aggregate with equal amount of ash and super plastic give a good strength concrete.
fig, ax = plt.subplots(figsize=(10,7))
sns.scatterplot(y="strength", x="superplastic", hue="water", size="fineagg", data=df, ax=ax, sizes=(50, 300))
ax.set_title("strength vs (superplastic,fineagg , water)")
ax.legend(loc="upper left", bbox_to_anchor=(1,1))
plt.show()
Superplastic of around 5 to 25 with a quantity of 150-180 for water and 650 quantity value of fine aggregate mixture gives a good strength of concrete.
Strength of concrete highly depends on the quantity of water present in it along with a fixed amount of fine aggregate. Dropping of superplastic completely from mixture has no effect on strength determination.
# Preparing a pandas dataframe to store the skewness of each column.
Skewness = pd.DataFrame({'Skewness': [stats.skew(df.cement), stats.skew(df.slag),
stats.skew(df.ash), stats.skew(df.water),
stats.skew(df.superplastic),
stats.skew(df.coarseagg), stats.skew(df.fineagg),
stats.skew(df.age), stats.skew(df.strength)]},
index = ['cement', 'slag', 'ash', 'water', 'superplastic',
'coarseagg','fineagg', 'age', 'strength'])
Skewness
Feature 'age' is highly right skewed due to presence of outliers, it is the time required to induce the strength into concrete, so some of the concretes have been kept for long days and allowed to strengthen by creating data points far away from IQR range.
Features like 'superplastic', 'Slag', 'ash' and 'cement' are the ingredients which determine the strength of the concrete. By adding these in correct proportion enhances the strength. The skewness in these columns may be due to incorrect proportion of mixture. In some cases the proportion may be due to presence of too much higher or lower quantity of ingredients with respect others, thus, by exceeding the IQR range.
# Correlation of entire dataframe
corr_matrix = df.corr()
# Features more related to Strength of Concrete
corr_matrix['strength'].sort_values(ascending = False)
Out of all features, cement has highest correlation value with target column followed by superplastic, age and slag.
Though, the correlation is not linear or value nearly equal to 1 or -1, some of the independent features influence the strength column.
sns.pairplot(df, diag_kind = 'kde', markers="^",palette = 'git_ow',
plot_kws=dict(s=50, edgecolor="red", linewidth=0.5),
diag_kws=dict(shade=True));
The pair plot panel gives us some important visual observations among independent columns and target column.
From the KDEs i.e. diagonal parts it can be observed that, there may be 2 to 7 clusters available in the entire dataset.
Some of the columns are having a data cloud but the spread seems to be higher.
The collection of data except age column, gives almost a cloud with each other, however a linear vertical line can also be seen, which clearly indicates the accumulation of data points at a single value or record.
The age column which ranges from 1 to 365, gives us almost 5 clusters, with 5 linear lines stating the count of days for each five records.
None of the columns are highly correlated, means none of them are influencing each other in any manner. So absence of multicollinearity can be felt among them.
This is a good sign of relationship among independent attributes, and can faintly affect the model performance.
Due to the presence of long tails, the leverage of outliers can be felt, and proper feature engineering is required to gain model performance.
The relationship of all the independent attributes is nonlinear with target column.
Except the cement column, all are having a poor correlation with target or strength column.
Though the age column has a linear relationship with target column at different clusters, the slope seems to be equal to zero.
Since, most of the data points are cluttered in scatter cloud for a mathematical space between target column and independent columns, a nonlinear relationship can be inferred among them signifying a poor correlation between target and independent attributes.
Thus the columns having poor correlation values with target column can be merged and some composite attributes can be generated.
df.corr().T
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (15,12))
plt.title('Pearson Correlation of attributes', y=1, size = 20)
sns.heatmap(df.corr(), linewidth = 0.2, vmax = 1.0,
square = True, cmap = colormap,linecolor = 'red', annot = True);
None of the columns are highly correlated to each other.
The highest correlation among columns can be observed between superplastic and water, which is -0.66 means negatively correlated.
The target column has a good correlation with cement column, however, none of the other columns has a good linear relationship with target column.
The correlation value of some of the independent attributes with target is nearly equal, so merging of those columns can also be thought off.
So, here we do not need to drop any columns based upon the correlation values rather we can create some composite attributes by merging the columns with very less or nearly equal correlation value with target column.
In this part we have cleaned the data by imputing various data cleaning mechanisms.
The Exploratory Data Analysis has been carried out with the help of statistical description, Univariate, Bivariate & Multivariate plots, Correlation matrix, while checking for the IQR range, skewness of each attribute, correlation of independent attributes, mean, mode and median with standard deviation were also been calculated for each attribute.
Box plots, Histogram Plots, Violin Plots for density curves and scatter plot between each independent variable with respect to target column were also been reflected.
Outliers were also been detected with the help of Box plot for various attributes.
Mix of Gaussian has also been checked from pair plot. The probable clusters will be made in subsequent iteration part by using Unsupervised Learning process.
Presence of Outliers will also be addressed in subsequent iteration parts to enhance the model performance.
df_scaled = df.apply(zscore)
# converting the numpy array back into a dataframe
df_scaled = pd.DataFrame(df_scaled, columns = df.columns )
df_scaled.head()
X_scaled = df_scaled.drop('strength', axis=1)
y_scaled = df_scaled[['strength']]
# Split X and y into training and test set in 70:30 ratio
X_train_scaled, X_test_scaled, y_train_scaled, y_test_scaled = train_test_split(X_scaled, y_scaled, test_size=0.30, random_state=1)
# checking the split of data
print('{0:0.2f}% data is in training set'.format((len(X_train_scaled)/len(df.index))*100))
print('{0:0.2f}% data is in testing set'.format((len(X_test_scaled)/len(df.index))*100))
lr_model = LinearRegression()
lr_model.fit(X_train_scaled, y_train_scaled)
# Let us explore the coefficients for each of the independent attributes
for idx, col_name in enumerate(X_train_scaled.columns):
print("The coefficient for {} is {}".format(col_name, lr_model.coef_[0][idx]))
intercept = lr_model.intercept_[0]
print("The intercept for our model is {}".format(intercept))
# Model score - R2 or coeff of determinant
# R^2=1–RSS / TSS
# Model score - R2 or coeff of determinant
# R^2=1–RSS / TSS = RegErr / TSS
lr_model_score_train = lr_model.score(X_train_scaled, y_train_scaled)
print('Training model Accuracy value: {0:0.2f}%'.format(lr_model_score_train*100))
lr_model_score_test = lr_model.score(X_test_scaled, y_test_scaled)
print('Testing Model Accuracy value: {0:0.2f}%'.format(lr_model_score_test*100))
# lr_model.score(X_test_scaled, y_test_scaled)
# Is OLS a good model ? Should we be building a simple linear model ? Check the residuals for each predictor.
fig = plt.figure(figsize=(10,5))
sns.residplot(x= X_test_scaled['cement'], y= y_test_scaled['strength'], color='green', lowess=True )
fig = plt.figure(figsize=(10,5))
sns.residplot(x= X_test_scaled['superplastic'], y= y_test_scaled['strength'], color='red', lowess=True );
fig = plt.figure(figsize=(10,5))
sns.residplot(x= X_test_scaled['age'], y= y_test_scaled['strength'], color='blue', lowess=True );
fig = plt.figure(figsize=(10,5))
sns.residplot(x= X_test_scaled['slag'], y= y_test_scaled['strength'], color='violet', lowess=True );
fig = plt.figure(figsize=(10,5))
sns.residplot(x= X_test_scaled['ash'], y= y_test_scaled['strength'], color='purple', lowess=True );
fig = plt.figure(figsize=(10,5))
sns.residplot(x= X_test_scaled['coarseagg'], y= y_test_scaled['strength'], color='orange', lowess=True );
fig = plt.figure(figsize=(10,5))
sns.residplot(x= X_test_scaled['fineagg'], y= y_test_scaled['strength'], color='brown', lowess=True );
The term 'Linear' can be interpreted in two ways
Linearity in the variables - the independent variables x are raised to the power 1 Linearity in the parameters - The coefficient are raised to power 1, and x can be raised to any power.
So, here linear model means, linear in terms of parameters, the power of coefficient should be raised to power one but x can be of any power.
Complexity of the model also depends on the dimension we have included to build it. As the number of dimension increases, the complexity of the model also increases.
From the above residual plot the single pick and valley represents the model is quite linear in nature and simple. With more than one picks and valleys the model can be treated as quadratic model.
The Stochastic Disturbance or Stochastic Error term here is caused due to independent variables which are not taken into considerations, and this can be the error of disturbances which affects the target variable.
So from above graphical representation and fundamentals of Linear regression, it can be concluded that , the model will be simple linear model.
Further, we will check this in Ridge and Lasso models below and can reach out to conclusion.
So the model explains 64% of the variability in Y using X
R^2 is not a reliable metric as it always increases with addition of more attributes even if the attributes have no influence on the predicted variable. Instead we can use adjusted R^2 which removes the statistical chance that improves R^2.
Scikit does not provide a facility for adjusted R^2... so we use will be using statsmodel, a library that gives results similar to what you obtain in R language.
This library expects the X and Y to be given in one single data frame
train_data = pd.concat([X_train_scaled, y_train_scaled], axis=1)
train_data.head()
import statsmodels.formula.api as smf
lm1 = smf.ols(formula = 'strength ~ cement+slag+ash+water+superplastic+coarseagg+fineagg+age', data = train_data).fit()
lm1.params
print(lm1.summary()) # Inferential Statistics
# Let us check the sum of squared errors by predicting value of y for training cases and
# subtracting from the actual y for the training cases
mse = np.mean((lr_model.predict(X_test_scaled)-y_test_scaled)**2)
# underroot of mean_sq_error is standard deviation i.e. avg variance between predicted and actual
math.sqrt(mse)
print('Training model Accuracy value: {0:0.2f}%'.format(lr_model_score_train*100))
print('Testing Model Accuracy value: {0:0.2f}%'.format(lr_model_score_test*100))
y_pred_scaled = lr_model.predict(X_test_scaled)
sns.set(style = 'darkgrid', color_codes = True)
with sns.axes_style('white'):
sns.jointplot(x = y_test_scaled, y = pd.DataFrame(y_pred_scaled), kind = 'reg', color = 'red')
The scatter plot shown above is drawn between actual target feature and predicted target feature.
Though, the variation of points are linear in nature, the scattered points are located beyond a virtual linear line indicates the model incapacity to predict the target feature perfectly.
Let us build regularised models to predict the target feature
ridge = Ridge(alpha=.3) # Lambda or aplha or hyperparameter value = 0.3
ridge.fit(X_train_scaled,y_train_scaled)
print ("Ridge model:", (ridge.coef_))
print('Training model Accuracy value: {0:0.2f}%'.format((ridge.score(X_train_scaled,y_train_scaled))*100))
print('Testing Model Accuracy value: {0:0.2f}%'.format((ridge.score(X_test_scaled, y_test_scaled))*100))
y_pred_scaled_ridge = ridge.predict(X_test_scaled)
sns.set(style = 'darkgrid', color_codes = True)
with sns.axes_style('white'):
sns.jointplot(x = y_test_scaled, y = pd.DataFrame(y_pred_scaled_ridge), kind = 'reg', color = 'green')
# Let us check the sum of squared errors by predicting value of y for training cases and
# subtracting from the actual y for the training cases
mse2 = np.mean((ridge.predict(X_test_scaled)-y_test_scaled)**2)
# underroot of mean_sq_error is standard deviation i.e. avg variance between predicted and actual
math.sqrt(mse2)
lasso = Lasso(alpha=0.1) # Lambda or aplha or hyperparameter value = 0.1
lasso.fit(X_train_scaled,y_train_scaled)
print ("Lasso model:", (lasso.coef_))
print('Training model Accuracy value: {0:0.2f}%'.format((lasso.score(X_train_scaled,y_train_scaled))*100))
print('Testing Model Accuracy value: {0:0.2f}%'.format((lasso.score(X_test_scaled, y_test_scaled))*100))
y_pred_scaled_lasso = lasso.predict(X_test_scaled)
sns.set(style = 'darkgrid', color_codes = True)
with sns.axes_style('white'):
sns.jointplot(x = y_test_scaled, y = pd.DataFrame(y_pred_scaled_lasso), kind = 'reg', color = 'teal')
Continuing the above discussion, we have built two regularised linear models.
Both the models have tried to minimize the SSE (Sum of Square error) by finding out the least combination of variables.
The ridge model has tried to push the coefficients towards zero by penalizing large magnitude coefficients while keeping the cost function same as linear un-regularised model.
Thus by doing so, coefficient of some of the variables are dragged towards zero and made the model simple and prevented it from overfitting.
Similarly, for Lasso model unlike Ridge, the penalty term makes the coefficients to zero which has high magnitude. Thus, by dropping the variable itself.
This penalty term ensures the corresponding variable is totally dropped from the model building and makes the model much simpler and prevents it from overfitting. However the cost function is retained unaffected and almost equal to the cost function of non-regularised linear model.
Thus from the above discussion, it can be concluded that, for this dataset we will be building very simple linear models and avoid any higher degree or quadratic in terms of parameters.
The scatter plot shown above is drawn between actual target feature and predicted target feature for both Regularised Ridge and Lasso models.
Similarly, though, the variation of points are linear in nature, the scatterings of the points beyond a virtual linear line indicates the model shapelessness to predict the target feature perfectly.
Here, Regularised models also failed to improve the score and prediction capability like un-regularised model. The scores are also nearly same to un-regularised model.
Let us try to improve the scores by generating polynomial features for linear models which will reflect the non-linear interaction between some dimensions while considering the significant correlation among them.
poly = PolynomialFeatures(degree = 2, interaction_only = True)
# degree = 2: means we have allowed to creat polynomials upto the power 2 of the existing columns.
# interaction_only = True: those dimensions are taken which shows some correlation among them
X_poly= poly.fit_transform(X_scaled)
X_train_p, X_test_p, y_train_p, y_test_p = train_test_split(X_poly, y_scaled, test_size=0.30, random_state= 10248)
X_train_p.shape
lr_model.fit(X_train_p, y_train_p)
print(lr_model.coef_[0])
# Let us check the sum of squared errors by predicting value of y for training cases and
# subtracting from the actual y for the training cases
mse11 = np.mean((lr_model.predict(X_test_p)-y_test_p)**2)
# underroot of mean_sq_error is standard deviation i.e. avg variance between predicted and actual
math.sqrt(mse11)
ridge = Ridge(alpha =0.3)
ridge.fit(X_train_p, y_train_p)
print('Ridge model:', (ridge.coef_))
# Let us check the sum of squared errors by predicting value of y for training cases and
# subtracting from the actual y for the training cases
mse22 = np.mean((ridge.predict(X_test_p)-y_test_p)**2)
# underroot of mean_sq_error is standard deviation i.e. avg variance between predicted and actual
math.sqrt(mse22)
lasso = Lasso(alpha =0.2)
lasso.fit(X_train_p, y_train_p)
print('Lasso model:', (lasso.coef_))
# Let us check the sum of squared errors by predicting value of y for training cases and
# subtracting from the actual y for the training cases
mse33 = np.mean(((lasso.predict(X_test_p)).reshape(309,1) -y_test_p)**2)
# underroot of mean_sq_error is standard deviation i.e. avg variance between predicted and actual
math.sqrt(mse33)
print('Training model Accuracy value for Linear Model: {0:0.2f}%'.format((lr_model.score(X_train_p, y_train_p))*100))
print('Testing Model Accuracy value for Linear Model: {0:0.2f}%'.format((lr_model.score(X_test_p, y_test_p))*100))
y_pred_scaled_lr = lr_model.predict(X_test_p)
sns.set(style = 'darkgrid', color_codes = True)
with sns.axes_style('white'):
sns.jointplot(x = y_test_p, y = pd.DataFrame(y_pred_scaled_lr), kind = 'reg', color = 'green')
print('Training model Accuracy value for Ridge Model: {0:0.2f}%'.format((ridge.score(X_train_p, y_train_p))*100))
print('Testing Model Accuracy value for Ridge Model: {0:0.2f}%'.format((ridge.score(X_test_p, y_test_p))*100))
y_pred_scaled_ridge_p = ridge.predict(X_test_p)
sns.set(style = 'darkgrid', color_codes = True)
with sns.axes_style('white'):
sns.jointplot(x = y_test_p, y = pd.DataFrame(y_pred_scaled_ridge_p), kind = 'reg', color = 'r')
print('Training model Accuracy value for Lasso Model: {0:0.2f}%'.format((lasso.score(X_train_p, y_train_p))*100))
print('Testing Model Accuracy value for Lasso Model: {0:0.2f}%'.format((lasso.score(X_test_p, y_test_p))*100))
y_pred_scaled_lasso_p = lasso.predict(X_test_p)
sns.set(style = 'darkgrid', color_codes = True)
with sns.axes_style('white'):
sns.jointplot(x = y_test_p, y = pd.DataFrame(y_pred_scaled_lasso_p), kind = 'reg', color = 'violet')
More or less similar results for linear regression and ridge models but lasso has a very low score with only four polynomial features and less complex model , as complexity is a function of variables and coefficients.
But in case of Ridge and un-regularised model the dimensions are almost same.
#ways of dropping variables:
#significance of variables (p-values)
#VIF--variance inflation factor
#Computing VIF
#VIF=1/1-r^2
#create a dataframe which will contain all the features and their respective VIF values
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif=pd.DataFrame()
vif['Features'] = X_train_scaled.columns
vif['VIF'] = [variance_inflation_factor(X_train_scaled.values,i) for i in range(X_train_scaled.shape[1]) ]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF', ascending = False)
vif
None of the column are having very high VIF value, so we may not be required to drop any of them. However some composite columns can be formed to check the model performance.
Based upon the VIF score of ash & slag, their correlation with target column and the importance in strengthening the concrete we can merge these two columns, and create a composite column.
dtr_model = DecisionTreeRegressor(random_state=0, max_depth=3)
dtr_model.fit(X_train_scaled, y_train_scaled)
y_pred = dtr_model.predict(X_test_scaled)
print('Training model Accuracy value for Decision Tree Regressor Model: {0:0.2f}%'.format((dtr_model.score(X_train_scaled, y_train_scaled))*100))
print('Testing Model Accuracy value for Decision Tree Regressor Model: {0:0.2f}%'.format((dtr_model.score(X_test_scaled, y_test_scaled))*100))
sns.set(style = 'darkgrid', color_codes = True)
with sns.axes_style('dark'):
sns.jointplot(x = y_test_scaled, y=pd.DataFrame(y_pred), kind = 'reg', color = 'purple')
# Let us check the sum of squared errors by predicting value of y for training cases and
# subtracting from the actual y for the training cases
mse4 = np.mean((y_pred.reshape(309,1) - y_test_scaled)**2)
# underroot of mean_sq_error is standard deviation i.e. avg variance between predicted and actual
math.sqrt(mse4)
feature_importances = dtr_model.feature_importances_
feature_names = df.columns[0:8]
print(feature_names)
k=8
print(feature_importances)
top_k_idx = (feature_importances.argsort()[-k:][::-1])
print(feature_names[top_k_idx], feature_importances)
Among all features, columns like cement, water and age are highly important features in case of decision tree regression model.
These columns play important role determining the model performance.
Here we got a model performance score of 58% which is very low. We will explore further to enhance the model performance
from sklearn import svm
clr = svm.SVR()
clr.fit(X_train_scaled, y_train_scaled)
print('Training model Accuracy value for Support Vector Regressor Model: {0:0.2f}%'.format((clr.score(X_train_scaled, y_train_scaled))*100))
print('Testing Model Accuracy value for Support Vector Regressor Model: {0:0.2f}%'.format((clr.score(X_test_scaled, y_test_scaled))*100))
y_pred_clr = clr.predict(X_test_scaled)
sns.set(style = 'darkgrid', color_codes = True)
with sns.axes_style('white'):
sns.jointplot(x = y_test_scaled, y = pd.DataFrame(y_pred_clr), kind = 'reg', color = 'green')
# Let us check the sum of squared errors by predicting value of y for training cases and
# subtracting from the actual y for the training cases
mse5 = np.mean(((clr.predict(X_test_scaled)).reshape(309,1) -y_test_scaled)**2)
# underroot of mean_sq_error is standard deviation i.e. avg variance between predicted and actual
math.sqrt(mse5)
Here, we got a very good score of 82%, which is the best among all the models.
Let's explore a bit more to enhance the model performance.
rfr = RandomForestRegressor(n_estimators = 50, random_state = 559, max_features = 8 )
rfr = rfr.fit(X_train_scaled, y_train_scaled)
print('Training model Accuracy value for Random Forest Regressor Model: {0:0.2f}%'.format((rfr.score(X_train_scaled, y_train_scaled))*100))
print('Testing Model Accuracy value for Random Forest Regressor Model: {0:0.2f}%'.format((rfr.score(X_test_scaled, y_test_scaled))*100))
y_predict_rfr = rfr.predict(X_test_scaled)
sns.set(style = 'darkgrid', color_codes = True)
with sns.axes_style('white'):
sns.jointplot(x = y_test_scaled, y = pd.DataFrame(y_predict_rfr), kind = 'reg', color = 'brown')
# Let us check the sum of squared errors by predicting value of y for training cases and
# subtracting from the actual y for the training cases
mse6 = np.mean(((rfr.predict(X_test_scaled)).reshape(309,1) -y_test_scaled)**2)
# underroot of mean_sq_error is standard deviation i.e. avg variance between predicted and actual
math.sqrt(mse6)
Here, we got a very good score of 90.36%, which is the best among all the models.
Let's explore a bit more to enhance the model performance.
xgbr = xgboost.XGBRegressor()
xgbr.fit(X_train_scaled, y_train_scaled)
# Predicting the Test set results
print('Training model Accuracy value for XGBoost Regressor Model: {0:0.2f}%'.format((xgbr.score(X_train_scaled, y_train_scaled))*100))
print('Testing Model Accuracy value for XGBoost Regressor Model: {0:0.2f}%'.format((xgbr.score(X_test_scaled, y_test_scaled))*100))
y_pred_xgbr = xgbr.predict(X_test_scaled)
sns.set(style = 'darkgrid', color_codes = True)
with sns.axes_style('white'):
sns.jointplot(x = y_test_scaled, y = pd.DataFrame(y_pred_xgbr), kind = 'reg', color = 'orange')
# Let us check the sum of squared errors by predicting value of y for training cases and
# subtracting from the actual y for the training cases
mse7 = np.mean(((xgbr.predict(X_test_scaled)).reshape(309,1) -y_test_scaled)**2)
# underroot of mean_sq_error is standard deviation i.e. avg variance between predicted and actual
math.sqrt(mse7)
Here, we got a very good score of 92.20%, which is the best among all the models.
Let's explore a bit more to enhance the model performance.
gbr = GradientBoostingRegressor(n_estimators = 50, random_state = 559, max_features = 8 )
gbr.fit(X_train_scaled, y_train_scaled)
# Predicting the Test set results
print('Training model Accuracy value for Gradient Boost Regressor Model: {0:0.2f}%'.format((gbr.score(X_train_scaled, y_train_scaled))*100))
print('Testing Model Accuracy value for Gradient Boost Regressor Model: {0:0.2f}%'.format((gbr.score(X_test_scaled, y_test_scaled))*100))
y_pred_gbr = gbr.predict(X_test_scaled)
sns.set(style = 'darkgrid', color_codes = True)
with sns.axes_style('white'):
sns.jointplot(x = y_test_scaled, y = pd.DataFrame(y_pred_gbr), kind = 'reg', color = 'blue')
# Let us check the sum of squared errors by predicting value of y for training cases and
# subtracting from the actual y for the training cases
mse8 = np.mean(((gbr.predict(X_test_scaled)).reshape(309,1) -y_test_scaled)**2)
# underroot of mean_sq_error is standard deviation i.e. avg variance between predicted and actual
math.sqrt(mse8)
Here, we got a very good score of 86.61%, which is the best among all the models.
Let's explore a bit more to enhance the model performance.
req_col_names = ['cement', 'slag', 'ash', 'water', 'superplastic', 'coarseagg',
'fineagg', 'age', 'strength']
feature_dtr = dtr_model.feature_importances_
feature_rfr = rfr.feature_importances_
labels = req_col_names[:-1]
x = np.arange(len(labels))
width = 0.3
fig, ax = plt.subplots(figsize=(10,6))
rects1 = ax.bar(x-(width/2), feature_dtr, width, label='Decision Tree')
rects2 = ax.bar(x+(width/2), feature_rfr, width, label='Random Forest')
ax.set_ylabel('Importance')
ax.set_xlabel('Features')
ax.set_title('Feature Importance')
ax.set_xticks(x)
ax.set_xticklabels(labels, rotation=45)
ax.legend(loc="upper left", bbox_to_anchor=(1,1))
def autolabel(rects):
"""Attach a text label above each bar in *rects*, displaying its height."""
for rect in rects:
height = rect.get_height()
ax.annotate('{:.2f}'.format(height), xy=(rect.get_x() + rect.get_width() / 2, height),
xytext=(0, 3), textcoords="offset points", ha='center', va='bottom')
autolabel(rects1)
autolabel(rects2)
fig.tight_layout()
plt.show()
The columns like Cement and age are highly important columns based on their importance values from tree based models like Decision Tree and Random Forest.
Other columns like water is also important based upon the importance values given by two tree based models.
So, these three features are important features in terms of calculation of model score and minimisation of cost function.
Rest of the columns are also required to calculate the model score.
However, As per Decision tree, only cement, water and age are important columns and rest of the columns can be dropped out from calculation. This is what makes the model simpler and prevents it from overfitting.
Random forest model has given importance to each of the columns, but it is almost negligible
req_col_names = ['cement', 'slag', 'ash', 'water', 'superplastic', 'coarseagg',
'fineagg', 'age', 'strength']
feature_xgbr = xgbr.feature_importances_
feature_gbr = gbr.feature_importances_
labels = req_col_names[:-1]
x = np.arange(len(labels))
width = 0.3
fig, ax = plt.subplots(figsize=(10,6))
rects1 = ax.bar(x-(width/2), feature_dtr, width, label='XGBoost Regressor')
rects2 = ax.bar(x+(width/2), feature_rfr, width, label='Gradient Boost Regressor')
ax.set_ylabel('Importance')
ax.set_xlabel('Features')
ax.set_title('Feature Importance')
ax.set_xticks(x)
ax.set_xticklabels(labels, rotation=45)
ax.legend(loc="upper left", bbox_to_anchor=(1,1))
def autolabel(rects):
"""Attach a text label above each bar in *rects*, displaying its height."""
for rect in rects:
height = rect.get_height()
ax.annotate('{:.2f}'.format(height), xy=(rect.get_x() + rect.get_width() / 2, height),
xytext=(0, 3), textcoords="offset points", ha='center', va='bottom')
autolabel(rects1)
autolabel(rects2)
fig.tight_layout()
plt.show()
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
num_folds = 10
seed = 235
kfold = KFold(n_splits = num_folds, random_state = seed)
result_lr = cross_val_score(lr_model, X_scaled, y_scaled, cv = kfold)
print(result_lr)
print('\n')
print('Model Score in itteration 1 for Linear Regression:%.3f%% (%.3f%%)'%(result_lr.mean()*100.0, result_lr.std()*100.0))
print('\n')
result_ridge = cross_val_score(ridge, X_scaled, y_scaled, cv = kfold)
print(result_ridge)
print('\n')
print('Model Score in itteration 1 for Ridge:%.3f%% (%.3f%%)'%(result_ridge.mean()*100.0, result_ridge.std()*100.0))
print('\n')
result_lasso = cross_val_score(lasso, X_scaled, y_scaled, cv = kfold)
print(result_lasso)
print('\n')
print('Model Score in itteration 1 for Lasso:%.3f%% (%.3f%%)'%(result_lasso.mean()*100.0, result_lasso.std()*100.0))
print('\n')
result_DTR = cross_val_score(dtr_model, X_scaled, y_scaled, cv = kfold)
print(result_DTR)
print('\n')
print('Model Score in itteration 1 for Desission Tree Regressor:%.3f%% (%.3f%%)'%(result_DTR.mean()*100.0, result_DTR.std()*100.0))
print('\n')
result_clr = cross_val_score(clr, X_scaled, y_scaled, cv = kfold)
print(result_clr)
print('\n')
print('Model Score in itteration 1 for Support Vector Regressor:%.3f%% (%.3f%%)'%(result_clr.mean()*100.0, result_clr.std()*100.0))
print('\n')
result_rfr = cross_val_score(rfr, X_scaled, y_scaled, cv = kfold)
print(result_rfr)
print('\n')
print('Model Score in itteration 1 for Random Forest Regressor:%.3f%% (%.3f%%)'%(result_rfr.mean()*100.0, result_rfr.std()*100.0))
print('\n')
result_xgbr = cross_val_score(xgbr, X_scaled, y_scaled, cv = kfold)
print(result_xgbr)
print('\n')
print('Model Score in itteration 1 for XGBoost Regressor:%.3f%% (%.3f%%)'%(result_xgbr.mean()*100.0, result_xgbr.std()*100.0))
print('\n')
result_gbr = cross_val_score(gbr, X_scaled, y_scaled, cv = kfold)
print(result_gbr)
print('\n')
print('Model Score in itteration 1 for Gradient Boosting Regressor:%.3f%% (%.3f%%)'%(result_gbr.mean()*100.0, result_gbr.std()*100.0))
result_itteration1 = pd.DataFrame({'Algorithm' : ['Linear Regression', 'Ridge', 'Lasso','Deision Tree', 'Support Vector',
'Random Forest', 'XGBoost', 'GradientBoost'],
'Model_score': [lr_model.score(X_test_p, y_test_p)*100.0, ridge.score(X_test_p, y_test_p)*100.0,
lasso.score(X_test_p, y_test_p)*100.0, dtr_model.score(X_test_scaled, y_test_scaled)*100.0,
clr.score(X_test_scaled, y_test_scaled)*100.0, rfr.score(X_test_scaled, y_test_scaled)*100.0,
xgbr.score(X_test_scaled, y_test_scaled)*100.0, gbr.score(X_test_scaled, y_test_scaled)*100.0],
'Root Mean Square Error' : [math.sqrt(mse11), math.sqrt(mse22), math.sqrt(mse33),
math.sqrt(mse4), math.sqrt(mse5), math.sqrt(mse6),
math.sqrt(mse7), math.sqrt(mse8)],
'Cross_val_score' : [result_lr.mean()*100.0, result_ridge.mean()*100.0, result_lasso.mean()*100.0,
result_DTR.mean()*100.0, result_clr.mean()*100.0, result_rfr.mean()*100.0,
result_xgbr.mean()*100.0, result_gbr.mean()*100.0],
'Std_Dev':[result_lr.std()*100.0, result_ridge.std()*100.0, result_lasso.std()*100.0,
result_DTR.std()*100.0, result_clr.std()*100.0, result_rfr.std()*100.0,
result_xgbr.std()*100.0, result_gbr.std()*100.0]})
result_itteration1
In Linear Regression Analysis, Cost function is the Root mean Square error, and we need to minimize it. The model which gives minimum RMSE value is the suitable model to go with. So here, XGBoost has the lowest RMSE value, and can be chosen for further analysis. However, model score also plays an important role to decide the accurate model.
Among all models XGBoost has maximum score with train_test_split method,
Similarly among all the models XGBoost has also maximum cross validation score with a standard deviation of 2.49.
Cross validation score of Linear regression, Ridge and Lasso models are less than normal score, this is because, in normal score we have considered the features with the help of polynomial Feature Elimination.
However, without adopting polynomial features these three models give a score almost similar to cross validation score.
We will try to explore more about the scores and enhance the model performance by addressing the outliers present in the features.
The data seems to be a mix of Gaussians, thus exploring more from the pair panel visual inspection, and we can expect 4 to 5 clusters from this dataset.
Thus, we will be restricting our clusters from 3 to 8.
To build the clusters we will be using K-Means Clustering Technique from Unsupervised Learning
cluster_range = range (3,12)
cluster_errors = []
for num_clusters in cluster_range:
clusters = KMeans (num_clusters, n_init = 5)
clusters.fit(df)
labels = clusters.labels_
centroids = clusters.cluster_centers_
cluster_errors.append(clusters.inertia_)
clusters_df = pd.DataFrame({'num_clusters':cluster_range, 'cluster_errors': cluster_errors})
clusters_df[0:30]
The above pivot table shows us the cluster error for each cluster
Let's draw the Elbow Plot to see the number of clusters
# Elbow Plot
plt.figure(figsize = (12,6))
plt.plot(clusters_df.num_clusters, clusters_df.cluster_errors, marker = 'o');
cluster = KMeans(n_clusters = 8, random_state = 2535)
cluster.fit(df_scaled)
prediction = cluster.predict(df_scaled)
# Creating a new column 'GROUP' which will hold the cluster id of each record
df_scaled['GROUP'] = prediction
# Creating a mirror copy for later re-use instead of building repeatedly
df_scaled_copy = df_scaled.copy(deep = True)
centroids = cluster.cluster_centers_
centroids
df_scaled.boxplot(by = 'GROUP', layout = (3,4), figsize = (15,20), color = 'orange');
# Addressing the outliers at group level
def replace(group):
median, std = group.median(), group.std() # Get the median and the standard deviation of every group
outliers = (group - median).abs()> 2*std # Subtract median from every member of each group. Take absolute values > 2 std. dev
group[outliers] = group.median()
return group
df_corrected = (df_scaled.groupby('GROUP').transform(replace))
concat_data = df_corrected.join(pd.DataFrame(df_scaled['GROUP']))
concat_data.boxplot(by = 'GROUP', layout = (3,4), figsize = (15,20), color = 'blue');
NOTE: When we remove outliers and replace with median or mean, the distribution shape changes, the standard deviation becomes tighter creating new outliers. The new outliers would be much closer to the centre than original outliers so we accept them without modifying them.
By replacing outliers, forcefully we created some new outliers by making the boundary much tighter.
with sns.axes_style('white'):
plot = sns.lmplot('cement', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('Cement vs. Strength', color = 'brown');
with sns.axes_style('white'):
plot = sns.lmplot('ash', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('Ash vs. Strength', color = 'brown');
with sns.axes_style('white'):
plot = sns.lmplot('water', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('Water vs. Strength', color = 'brown');
with sns.axes_style('white'):
plot = sns.lmplot('slag', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('Slag vs. Strength', color = 'brown');
with sns.axes_style('white'):
plot = sns.lmplot('age', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('Age vs. Strength', color = 'brown');
with sns.axes_style('white'):
plot = sns.lmplot('fineagg', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('Fine Aggregate vs. Strength', color = 'brown');
with sns.axes_style('white'):
plot = sns.lmplot('coarseagg', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('Coarse Aggregate vs. Strength', color = 'brown');
with sns.axes_style('white'):
plot = sns.lmplot('superplastic', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('Super Plastic vs. Strength', color = 'brown');
The above plots show the relation of target column with independent columns in cluster wise or group wise format.
For each independent variables, group of clusters were plotted with their respective best fit lines.
The more horizontal a line is, the more week it is in predicting the target column. It means, a horizontal line will only take one value of target column and fix it. So the spread of all data points can't be captured by this line.
We can observe from above plots that, in each of the cluster, data are far widen or spread away from the best fit line signifying too large variance in the cluster.
df_corrected.head()
X_scaled_corre = df_corrected.drop('strength', axis =1)
y_scaled_corre = df_corrected[['strength']]
# Split X and y into training and test set in 70:30 ratio
X_train_corre, X_test_corre, y_train_corre, y_test_corre = train_test_split(X_scaled_corre, y_scaled_corre, test_size=0.30, random_state=1)
lr_model.fit(X_train_corre, y_train_corre)
ridge.fit(X_train_corre,y_train_corre)
lasso.fit(X_train_corre,y_train_corre)
dtr_model.fit(X_train_corre,y_train_corre)
clr.fit(X_train_corre,y_train_corre)
rfr = rfr.fit(X_train_corre,y_train_corre)
xgbr.fit(X_train_corre,y_train_corre)
gbr.fit(X_train_corre,y_train_corre)
# Predicting the Test set results
print('Training model Accuracy value for Linear Regressor Model: {0:0.2f}%'.format((lr_model.score(X_train_corre,y_train_corre))*100))
print('Testing Model Accuracy value for Linear Regressor Model: {0:0.2f}%'.format((lr_model.score(X_test_corre, y_test_corre))*100))
y_pred_lr_model = lr_model.predict(X_test_corre)
sns.set(style = 'darkgrid', color_codes = True)
with sns.axes_style('white'):
sns.jointplot(x = y_test_corre, y = pd.DataFrame(y_pred_lr_model), kind = 'reg', color = 'blue')
# Predicting the Test set results
print('Training model Accuracy value for Ridge Regressor Model: {0:0.2f}%'.format((ridge.score(X_train_corre,y_train_corre))*100))
print('Testing Model Accuracy value for Ridge Regressor Model: {0:0.2f}%'.format((ridge.score(X_test_corre, y_test_corre))*100))
y_pred_ridge = ridge.predict(X_test_corre)
sns.set(style = 'darkgrid', color_codes = True)
with sns.axes_style('white'):
sns.jointplot(x = y_test_corre, y = pd.DataFrame(y_pred_ridge), kind = 'reg', color = 'green')
# Predicting the Test set results
print('Training model Accuracy value for Lasso Regressor Model: {0:0.2f}%'.format((lasso.score(X_train_corre,y_train_corre))*100))
print('Testing Model Accuracy value for Lasso Regressor Model: {0:0.2f}%'.format((lasso.score(X_test_corre, y_test_corre))*100))
y_pred_lasso = lasso.predict(X_test_corre)
sns.set(style = 'darkgrid', color_codes = True)
with sns.axes_style('white'):
sns.jointplot(x = y_test_corre, y = pd.DataFrame(y_pred_lasso), kind = 'reg', color = 'red')
# Predicting the Test set results
print('Training model Accuracy value for Decision Tree Regressor Model: {0:0.2f}%'.format((dtr_model.score(X_train_corre,y_train_corre))*100))
print('Testing Model Accuracy value for Decision Tree Regressor Model: {0:0.2f}%'.format((dtr_model.score(X_test_corre, y_test_corre))*100))
y_pred_dtr_model = dtr_model.predict(X_test_corre)
sns.set(style = 'darkgrid', color_codes = True)
with sns.axes_style('white'):
sns.jointplot(x = y_test_corre, y = pd.DataFrame(y_pred_dtr_model), kind = 'reg', color = 'violet')
# Predicting the Test set results
print('Training model Accuracy value for Support vector Regressor Model: {0:0.2f}%'.format((clr.score(X_train_corre,y_train_corre))*100))
print('Testing Model Accuracy value for Support vector Regressor Model: {0:0.2f}%'.format((clr.score(X_test_corre, y_test_corre))*100))
y_pred_clr = clr.predict(X_test_corre)
sns.set(style = 'darkgrid', color_codes = True)
with sns.axes_style('white'):
sns.jointplot(x = y_test_corre, y = pd.DataFrame(y_pred_clr), kind = 'reg', color = 'purple')
# Predicting the Test set results
print('Training model Accuracy value for Random Forest Regressor Model: {0:0.2f}%'.format((rfr.score(X_train_corre,y_train_corre))*100))
print('Testing Model Accuracy value for Random Forest Regressor Model: {0:0.2f}%'.format((rfr.score(X_test_corre, y_test_corre))*100))
y_pred_rfr = rfr.predict(X_test_corre)
sns.set(style = 'darkgrid', color_codes = True)
with sns.axes_style('white'):
sns.jointplot(x = y_test_corre, y = pd.DataFrame(y_pred_rfr), kind = 'reg', color = 'orange')
# Predicting the Test set results
print('Training model Accuracy value for XGBoost Regressor Model: {0:0.2f}%'.format((xgbr.score(X_train_corre,y_train_corre))*100))
print('Testing Model Accuracy value for XGBoost Regressor Model: {0:0.2f}%'.format((xgbr.score(X_test_corre, y_test_corre))*100))
y_pred_xgbr = xgbr.predict(X_test_corre)
sns.set(style = 'darkgrid', color_codes = True)
with sns.axes_style('white'):
sns.jointplot(x = y_test_corre, y = pd.DataFrame(y_pred_xgbr), kind = 'reg', color = 'brown')
# Predicting the Test set results
print('Training model Accuracy value for Gradient Boost Regressor Model: {0:0.2f}%'.format((gbr.score(X_train_corre,y_train_corre))*100))
print('Testing Model Accuracy value for Gradient Boost Regressor Model: {0:0.2f}%'.format((gbr.score(X_test_corre, y_test_corre))*100))
y_pred_gbr = gbr.predict(X_test_corre)
sns.set(style = 'darkgrid', color_codes = True)
with sns.axes_style('white'):
sns.jointplot(x = y_test_corre, y = pd.DataFrame(y_pred_gbr), kind = 'reg', color = 'teal')
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
num_folds = 10
seed = 235
kfold = KFold(n_splits = num_folds, random_state = seed)
result_lr = cross_val_score(lr_model, X_scaled_corre, y_scaled_corre, cv = kfold)
print(result_lr)
print('\n')
print('Model Score in itteration 2 for Linear Regression:%.3f%% (%.3f%%)'%(result_lr.mean()*100.0, result_lr.std()*100.0))
print('\n')
result_ridge = cross_val_score(ridge, X_scaled_corre, y_scaled_corre, cv = kfold)
print(result_ridge)
print('\n')
print('Model Score in itteration 2 for Ridge:%.3f%% (%.3f%%)'%(result_ridge.mean()*100.0, result_ridge.std()*100.0))
print('\n')
result_lasso = cross_val_score(lasso, X_scaled_corre, y_scaled_corre, cv = kfold)
print(result_lasso)
print('\n')
print('Model Score in itteration 2 for Lasso:%.3f%% (%.3f%%)'%(result_lasso.mean()*100.0, result_lasso.std()*100.0))
print('\n')
result_DTR = cross_val_score(dtr_model, X_scaled_corre, y_scaled_corre, cv = kfold)
print(result_DTR)
print('\n')
print('Model Score in itteration 2 for Desission Tree Regressor:%.3f%% (%.3f%%)'%(result_DTR.mean()*100.0, result_DTR.std()*100.0))
print('\n')
result_clr = cross_val_score(clr, X_scaled_corre, y_scaled_corre, cv = kfold)
print(result_clr)
print('\n')
print('Model Score in itteration 2 for Support Vector Regressor:%.3f%% (%.3f%%)'%(result_clr.mean()*100.0, result_clr.std()*100.0))
print('\n')
result_rfr = cross_val_score(rfr, X_scaled_corre, y_scaled_corre, cv = kfold)
print(result_rfr)
print('\n')
print('Model Score in itteration 2 for Random Forest Regressor:%.3f%% (%.3f%%)'%(result_rfr.mean()*100.0, result_rfr.std()*100.0))
print('\n')
result_xgbr = cross_val_score(xgbr, X_scaled_corre, y_scaled_corre, cv = kfold)
print(result_xgbr)
print('\n')
print('Model Score in itteration 2 for XGBoost Regressor:%.3f%% (%.3f%%)'%(result_xgbr.mean()*100.0, result_xgbr.std()*100.0))
print('\n')
result_gbr = cross_val_score(gbr, X_scaled_corre, y_scaled_corre, cv = kfold)
print(result_gbr)
print('\n')
print('Model Score in itteration 2 for Gradient Boosting Regressor:%.3f%% (%.3f%%)'%(result_gbr.mean()*100.0, result_gbr.std()*100.0))
mse_i2_1 = np.mean((y_pred_lr_model - y_test_scaled)**2)
rmse_i2_1 = math.sqrt(mse_i2_1)
mse_i2_2 = np.mean((y_pred_ridge - y_test_scaled)**2)
rmse_i2_2 = math.sqrt(mse_i2_2)
mse_i2_3 = np.mean((y_pred_lasso.reshape(309,1) - y_test_p)**2)
rmse_i2_3 = math.sqrt(mse_i2_3)
mse_i2_4 = np.mean((y_pred_dtr_model.reshape(309,1) - y_test_scaled)**2)
rmse_i2_4 = math.sqrt(mse_i2_4)
mse_i2_5 = np.mean((y_pred_clr.reshape(309,1) - y_test_scaled)**2)
rmse_i2_5 = math.sqrt(mse_i2_5)
mse_i2_6 = np.mean((y_pred_rfr.reshape(309,1) -y_test_scaled)**2)
rmse_i2_6 = math.sqrt(mse_i2_6)
mse_i2_7 = np.mean((y_pred_xgbr.reshape(309,1) -y_test_scaled)**2)
rmse_i2_7 = math.sqrt(mse_i2_7)
mse_i2_8 = np.mean((y_pred_gbr.reshape(309,1) -y_test_scaled)**2)
rmse_i2_8 = math.sqrt(mse_i2_8)
result_itteration2 = pd.DataFrame({'Algorithm' : ['Linear Regression', 'Ridge', 'Lasso','Deision Tree', 'Support Vector',
'Random Forest', 'XGBoost', 'GradientBoost'],
'Model_score': [lr_model.score(X_test_corre, y_test_corre)*100.0, ridge.score(X_test_corre, y_test_corre)*100.0,
lasso.score(X_test_corre, y_test_corre)*100.0, dtr_model.score(X_test_corre, y_test_corre)*100.0,
clr.score(X_test_corre, y_test_corre)*100.0, rfr.score(X_test_corre, y_test_corre)*100.0,
xgbr.score(X_test_corre, y_test_corre)*100.0, gbr.score(X_test_corre, y_test_corre)*100.0],
'Root Mean Square Error' : [rmse_i2_1, rmse_i2_2, rmse_i2_3,
rmse_i2_4, rmse_i2_5, rmse_i2_6,
rmse_i2_7, rmse_i2_8],
'Cross_val_score' : [result_lr.mean()*100.0, result_ridge.mean()*100.0, result_lasso.mean()*100.0,
result_DTR.mean()*100.0, result_clr.mean()*100.0, result_rfr.mean()*100.0,
result_xgbr.mean()*100.0, result_gbr.mean()*100.0],
'Std_Dev':[result_lr.std()*100.0, result_ridge.std()*100.0, result_lasso.std()*100.0,
result_DTR.std()*100.0, result_clr.std()*100.0, result_rfr.std()*100.0,
result_xgbr.std()*100.0, result_gbr.std()*100.0]})
result_itteration2
It can be observed that, after addressing outliers for all the features by imputing the respective median value of each column, the score of the models has decreased.
However, among all models Radom Forest Repressor has maximum score with train_test_split method,
and XGBoost model has also maximum cross validation score with a standard deviation of 3.316.
We will try to explore more about the scores and enhance the model performance by creating some composite features and again addressing the outliers present in the features with different number of clusters.
df_attr = df.iloc[:, 0:10]
df_attr['cor_fine_agg'] = df_attr['coarseagg'] + df_attr['fineagg']
df_attr['ash_slag'] = df_attr['ash'] + df_attr['slag']
sns.scatterplot(df_attr.ash_slag, df_attr.strength,color = 'brown');
plt.title('Variation of strength with respect to both ash and slag', color = 'brown');
sns.scatterplot(df_attr.cor_fine_agg, df_attr.strength,color = 'purple');
plt.title('Variation of strength with respect to both fine and coarse aggeregate', color = 'brown');
corr_matrix = df_attr.corr()
corr_matrix['strength'].sort_values(ascending = False)
df_attr = df_attr.apply(zscore)
cluster_range = range (3,12)
cluster_errors = []
for num_clusters in cluster_range:
clusters = KMeans (num_clusters, n_init = 5)
clusters.fit(df_attr)
labels = clusters.labels_
centroids = clusters.cluster_centers_
cluster_errors.append(clusters.inertia_)
clusters_df_attr = pd.DataFrame({'num_clusters':cluster_range, 'cluster_errors': cluster_errors})
clusters_df_attr[0:30]
# Elbow Plot
plt.figure(figsize = (12,6))
plt.plot(clusters_df_attr.num_clusters, clusters_df_attr.cluster_errors, marker = 'o');
cluster = KMeans(n_clusters = 6, random_state = 2535)
cluster.fit(df_attr)
prediction = cluster.predict(df_attr)
# Creating a new column 'GROUP' which will hold the cluster id of each record
df_attr['GROUP'] = prediction
# Creating a mirror copy for later re-use instead of building repeatedly
df_attr_copy = df_attr.copy(deep = True)
centroids = cluster.cluster_centers_
centroids
df_attr.boxplot(by = 'GROUP', layout = (3,4), figsize = (15,20), color = 'orange');
# Addressing the outliers at group level
def replace(group):
median, std = group.median(), group.std() # Get the median and the standard deviation of every group
outliers = (group - median).abs()> 2*std # Subtract median from every member of each group. Take absolute values > 2 std. dev
group[outliers] = group.median()
return group
df_corrected = (df_attr.groupby('GROUP').transform(replace))
concat_data = df_corrected.join(pd.DataFrame(df_attr['GROUP']))
concat_data.boxplot(by = 'GROUP', layout = (3,4), figsize = (15,20), color = 'blue');
NOTE: When we remove outliers and replace with median or mean, the distribution shape changes, the standard deviation becomes tighter creating new outliers. The new outliers would be much closer to the centre than original outliers so we accept them without modifying them.
By replacing outliers, forcefully we created some new outliers by making the boundary much tighter.
with sns.axes_style('white'):
plot = sns.lmplot('cement', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('Cement vs. Strength', color = 'brown');
with sns.axes_style('white'):
plot = sns.lmplot('ash', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('Ash vs. Strength', color = 'brown');
with sns.axes_style('white'):
plot = sns.lmplot('water', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('Water vs. Strength', color = 'brown');
with sns.axes_style('white'):
plot = sns.lmplot('slag', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('Slag vs. Strength', color = 'brown');
with sns.axes_style('white'):
plot = sns.lmplot('age', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('Age vs. Strength', color = 'brown');
with sns.axes_style('white'):
plot = sns.lmplot('fineagg', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('Fine Aggregate vs. Strength', color = 'brown');
with sns.axes_style('white'):
plot = sns.lmplot('coarseagg', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('Coarse Aggregate vs. Strength', color = 'brown');
with sns.axes_style('white'):
plot = sns.lmplot('superplastic', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('Super Plastic vs. Strength', color = 'brown');
with sns.axes_style('white'):
plot = sns.lmplot('cor_fine_agg', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('cor_fine_agg vs. Strength', color = 'brown');
with sns.axes_style('white'):
plot = sns.lmplot('ash_slag', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('ash_slag vs. Strength', color = 'brown');
The above plots show the relation of target column with independent columns in a cluster wise or group wise format.
Here, we have made 6 clusters based upon the elbow plot.
And, for each independent variables, group of clusters were plotted with their respective best fit lines.
The more horizontal a line is, the more week it is in predicting the target column. It means, a horizontal line will only take one value of target column and fix it. So the spread of all data points can't be captured by this line.
We can observe from above plots that, in each of the cluster, data are far widen or spread away from the best fit line signifying too large variance in the cluster.
df_corrected.head()
X_scaled_corre = df_corrected.drop('strength', axis =1)
y_scaled_corre = df_corrected[['strength']]
# Split X and y into training and test set in 70:30 ratio
X_train_corre, X_test_corre, y_train_corre, y_test_corre = train_test_split(X_scaled_corre, y_scaled_corre, test_size=0.30, random_state=1)
lr_model.fit(X_train_corre, y_train_corre)
ridge.fit(X_train_corre,y_train_corre)
lasso.fit(X_train_corre,y_train_corre)
dtr_model.fit(X_train_corre,y_train_corre)
clr.fit(X_train_corre,y_train_corre)
rfr = rfr.fit(X_train_corre,y_train_corre)
xgbr.fit(X_train_corre,y_train_corre)
gbr.fit(X_train_corre,y_train_corre)
# Predicting the Test set results
print('Training model Accuracy value for Linear Regressor Model: {0:0.2f}%'.format((lr_model.score(X_train_corre,y_train_corre))*100))
print('Testing Model Accuracy value for Linear Regressor Model: {0:0.2f}%'.format((lr_model.score(X_test_corre, y_test_corre))*100))
y_pred_lr_model = lr_model.predict(X_test_corre)
sns.set(style = 'darkgrid', color_codes = True)
with sns.axes_style('white'):
sns.jointplot(x = y_test_corre, y = pd.DataFrame(y_pred_lr_model), kind = 'reg', color = 'green')
# Predicting the Test set results
print('Training model Accuracy value for Ridge Regressor Model: {0:0.2f}%'.format((ridge.score(X_train_corre,y_train_corre))*100))
print('Testing Model Accuracy value for Ridge Regressor Model: {0:0.2f}%'.format((ridge.score(X_test_corre, y_test_corre))*100))
y_pred_ridge = ridge.predict(X_test_corre)
sns.set(style = 'darkgrid', color_codes = True)
with sns.axes_style('white'):
sns.jointplot(x = y_test_corre, y = pd.DataFrame(y_pred_ridge), kind = 'reg', color = 'red')
# Predicting the Test set results
print('Training model Accuracy value for Lasso Regressor Model: {0:0.2f}%'.format((lasso.score(X_train_corre,y_train_corre))*100))
print('Testing Model Accuracy value for Lasso Regressor Model: {0:0.2f}%'.format((lasso.score(X_test_corre, y_test_corre))*100))
y_pred_lasso = lasso.predict(X_test_corre)
sns.set(style = 'darkgrid', color_codes = True)
with sns.axes_style('white'):
sns.jointplot(x = y_test_corre, y = pd.DataFrame(y_pred_lasso), kind = 'reg', color = 'purple')
# Predicting the Test set results
print('Training model Accuracy value for Decision Tree Regressor Model: {0:0.2f}%'.format((dtr_model.score(X_train_corre,y_train_corre))*100))
print('Testing Model Accuracy value for Decision Tree Regressor Model: {0:0.2f}%'.format((dtr_model.score(X_test_corre, y_test_corre))*100))
y_pred_dtr_model = dtr_model.predict(X_test_corre)
sns.set(style = 'darkgrid', color_codes = True)
with sns.axes_style('white'):
sns.jointplot(x = y_test_corre, y = pd.DataFrame(y_pred_dtr_model), kind = 'reg', color = 'blue')
# Predicting the Test set results
print('Training model Accuracy value for Support vector Regressor Model: {0:0.2f}%'.format((clr.score(X_train_corre,y_train_corre))*100))
print('Testing Model Accuracy value for Support vector Regressor Model: {0:0.2f}%'.format((clr.score(X_test_corre, y_test_corre))*100))
y_pred_clr = clr.predict(X_test_corre)
sns.set(style = 'darkgrid', color_codes = True)
with sns.axes_style('white'):
sns.jointplot(x = y_test_corre, y = pd.DataFrame(y_pred_clr), kind = 'reg', color = 'brown')
# Predicting the Test set results
print('Training model Accuracy value for Random Forest Regressor Model: {0:0.2f}%'.format((rfr.score(X_train_corre,y_train_corre))*100))
print('Testing Model Accuracy value for Random Forest Regressor Model: {0:0.2f}%'.format((rfr.score(X_test_corre, y_test_corre))*100))
y_pred_rfr = rfr.predict(X_test_corre)
sns.set(style = 'darkgrid', color_codes = True)
with sns.axes_style('white'):
sns.jointplot(x = y_test_corre, y = pd.DataFrame(y_pred_rfr), kind = 'reg', color = 'teal')
# Predicting the Test set results
print('Training model Accuracy value for XGBoost Regressor Model: {0:0.2f}%'.format((xgbr.score(X_train_corre,y_train_corre))*100))
print('Testing Model Accuracy value for XGBoost Regressor Model: {0:0.2f}%'.format((xgbr.score(X_test_corre, y_test_corre))*100))
y_pred_xgbr = xgbr.predict(X_test_corre)
sns.set(style = 'darkgrid', color_codes = True)
with sns.axes_style('white'):
sns.jointplot(x = y_test_corre, y = pd.DataFrame(y_pred_xgbr), kind = 'reg', color = 'orange')
# Predicting the Test set results
print('Training model Accuracy value for Gradient Boost Regressor Model: {0:0.2f}%'.format((gbr.score(X_train_corre,y_train_corre))*100))
print('Testing Model Accuracy value for Gradient Boost Regressor Model: {0:0.2f}%'.format((gbr.score(X_test_corre, y_test_corre))*100))
y_pred_gbr = gbr.predict(X_test_corre)
sns.set(style = 'darkgrid', color_codes = True)
with sns.axes_style('white'):
sns.jointplot(x = y_test_corre, y = pd.DataFrame(y_pred_gbr), kind = 'reg', color = 'brown')
mse_i3_1 = np.mean((y_pred_lr_model - y_test_scaled)**2)
rmse_i3_1 = math.sqrt(mse_i3_1)
mse_i3_2 = np.mean((y_pred_ridge - y_test_scaled)**2)
rmse_i3_2 = math.sqrt(mse_i3_2)
mse_i3_3 = np.mean((y_pred_lasso.reshape(309,1) - y_test_p)**2)
rmse_i3_3 = math.sqrt(mse_i3_3)
mse_i3_4 = np.mean((y_pred_dtr_model.reshape(309,1) - y_test_scaled)**2)
rmse_i3_4 = math.sqrt(mse_i3_4)
mse_i3_5 = np.mean((y_pred_clr.reshape(309,1) - y_test_scaled)**2)
rmse_i3_5 = math.sqrt(mse_i3_5)
mse_i3_6 = np.mean((y_pred_rfr.reshape(309,1) -y_test_scaled)**2)
rmse_i3_6 = math.sqrt(mse_i3_6)
mse_i3_7 = np.mean((y_pred_xgbr.reshape(309,1) -y_test_scaled)**2)
rmse_i3_7 = math.sqrt(mse_i3_7)
mse_i3_8 = np.mean((y_pred_gbr.reshape(309,1) -y_test_scaled)**2)
rmse_i3_8 = math.sqrt(mse_i3_8)
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
num_folds = 10
seed = 235
kfold = KFold(n_splits = num_folds, random_state = seed)
result_lr = cross_val_score(lr_model, X_scaled_corre, y_scaled_corre, cv = kfold)
print(result_lr)
print('\n')
print('Model Score in itteration 2 for Linear Regression:%.3f%% (%.3f%%)'%(result_lr.mean()*100.0, result_lr.std()*100.0))
print('\n')
result_ridge = cross_val_score(ridge, X_scaled_corre, y_scaled_corre, cv = kfold)
print(result_ridge)
print('\n')
print('Model Score in itteration 2 for Ridge:%.3f%% (%.3f%%)'%(result_ridge.mean()*100.0, result_ridge.std()*100.0))
print('\n')
result_lasso = cross_val_score(lasso, X_scaled_corre, y_scaled_corre, cv = kfold)
print(result_lasso)
print('\n')
print('Model Score in itteration 2 for Lasso:%.3f%% (%.3f%%)'%(result_lasso.mean()*100.0, result_lasso.std()*100.0))
print('\n')
result_DTR = cross_val_score(dtr_model, X_scaled_corre, y_scaled_corre, cv = kfold)
print(result_DTR)
print('\n')
print('Model Score in itteration 2 for Desission Tree Regressor:%.3f%% (%.3f%%)'%(result_DTR.mean()*100.0, result_DTR.std()*100.0))
print('\n')
result_clr = cross_val_score(clr, X_scaled_corre, y_scaled_corre, cv = kfold)
print(result_clr)
print('\n')
print('Model Score in itteration 2 for Support Vector Regressor:%.3f%% (%.3f%%)'%(result_clr.mean()*100.0, result_clr.std()*100.0))
print('\n')
result_rfr = cross_val_score(rfr, X_scaled_corre, y_scaled_corre, cv = kfold)
print(result_rfr)
print('\n')
print('Model Score in itteration 2 for Random Forest Regressor:%.3f%% (%.3f%%)'%(result_rfr.mean()*100.0, result_rfr.std()*100.0))
print('\n')
result_xgbr = cross_val_score(xgbr, X_scaled_corre, y_scaled_corre, cv = kfold)
print(result_xgbr)
print('\n')
print('Model Score in itteration 2 for XGBoost Regressor:%.3f%% (%.3f%%)'%(result_xgbr.mean()*100.0, result_xgbr.std()*100.0))
print('\n')
result_gbr = cross_val_score(gbr, X_scaled_corre, y_scaled_corre, cv = kfold)
print(result_gbr)
print('\n')
print('Model Score in itteration 2 for Gradient Boosting Regressor:%.3f%% (%.3f%%)'%(result_gbr.mean()*100.0, result_gbr.std()*100.0))
result_itteration3 = pd.DataFrame({'Algorithm' : ['Linear Regression', 'Ridge', 'Lasso','Deision Tree', 'Support Vector',
'Random Forest', 'XGBoost', 'GradientBoost'],
'Model_score': [lr_model.score(X_test_corre, y_test_corre)*100.0, ridge.score(X_test_corre, y_test_corre)*100.0,
lasso.score(X_test_corre, y_test_corre)*100.0, dtr_model.score(X_test_corre, y_test_corre)*100.0,
clr.score(X_test_corre, y_test_corre)*100.0, rfr.score(X_test_corre, y_test_corre)*100.0,
xgbr.score(X_test_corre, y_test_corre)*100.0, gbr.score(X_test_corre, y_test_corre)*100.0],
'Root Mean Square Error' : [rmse_i3_1, rmse_i3_2, rmse_i3_3,
rmse_i3_4, rmse_i3_5, rmse_i3_6,
rmse_i3_7, rmse_i3_8],
'Cross_val_score' : [result_lr.mean()*100.0, result_ridge.mean()*100.0, result_lasso.mean()*100.0,
result_DTR.mean()*100.0, result_clr.mean()*100.0, result_rfr.mean()*100.0,
result_xgbr.mean()*100.0, result_gbr.mean()*100.0],
'Std_Dev':[result_lr.std()*100.0, result_ridge.std()*100.0, result_lasso.std()*100.0,
result_DTR.std()*100.0, result_clr.std()*100.0, result_rfr.std()*100.0,
result_xgbr.std()*100.0, result_gbr.std()*100.0]})
result_itteration3
In Linear Regression Analysis, Cost function is the Root mean Square error, and we need to minimize it. The model which gives minimum RMSE value is the suitable model to go with. So here, Random Forest has the lowest RMSE value, and can be chosen for further analysis. However, model score also plays an important role to decide the accurate model.
It can be observed that, after generating two composite features and addressing outliers for all the features by imputing the respective median value of each column, the score of the models has not increased, rather it got affected further.
However, among all models Random Forest Repressor has maximum score with train_test_split method,
And it has also maximum cross validation score with a standard deviation of 3.886.
Based upon the RMSE value, Model prediction score and CV score, out of non-regularised model, regularised model and other regression model, only Random Forest Repressor and XGBoost repressor gave a decent result. And among all three iterations, we got a good score in iteration 1.
Thus we will go with the result of 1st iteration and the processes we have followed up for this.
Now we will further try to explore more on the model performance by considering only two best models whose mean scores are nearly equal and also highest among all.
In this iteration, we will tuning the model by following feature engineering process.
We will take XGBoost and Random Forest Regression models for model tuning and will check the score.
Model tuning, here we are going to use Grid Search CV and Randomized Search CV.
from sklearn.model_selection import GridSearchCV
parameters_rfr = {'bootstrap':[True],
'max_depth': [10,20,30,40,50],
'max_features': ['auto', 'sqrt'],
'min_samples_leaf': [1,2,4,8],
'n_estimators': [100]}
clf = GridSearchCV(RandomForestRegressor(), parameters_rfr, cv = 5, verbose = 2, n_jobs=4)
clf.fit(X_scaled, y_scaled)
clf.best_params_
rfr_gdcv = RandomForestRegressor(bootstrap=True,
max_depth=30,
max_features= 'auto',
min_samples_leaf= 1,
n_estimators=100)
rfr_gdcv.fit(X_train_scaled, y_train_scaled)
rfr_gdcv_score = cross_val_score(rfr_gdcv, X_test_scaled, y_test_scaled, cv=5)
print('Model Score in itteration 4 for Random Forest Regressor using GridSearch CV:%.3f%% (%.3f%%)'%(rfr_gdcv_score.mean()*100.0, rfr_gdcv_score.std()*100.0))
parameters_xgbr = {'gamma':[0.1,0.2,0.3,0.4],
'max_depth':[3,None],
'min_child_weight':[0,3,4,9],
'num_parallel_tree': [0,4,6,9],
'colsample_bylevel': [0,3,4,8],
'colsample_bynode': [0,3,7,9],
'colsample_bytree': [0,2,4,6],
'n_estimators':[50,80,100]}
gscv_xgbr = GridSearchCV(xgbr, parameters_xgbr, cv = 5, verbose = 2, n_jobs=4)
gscv_xgbr.fit(X_train_scaled, y_train_scaled)
gscv_xgbr.best_params_
xgbr_gscv = xgboost.XGBRegressor(colsample_bylevel=0,
colsample_bynode= 0,
colsample_bytree= 0,
gamma= 0.1,
max_depth= 3,
min_child_weight= 0,
n_estimators= 100,
num_parallel_tree= 9)
xgbr_gscv.fit(X_train_scaled, y_train_scaled)
xgbr_gscv_score = cross_val_score(xgbr_gscv, X_test_scaled, y_test_scaled, cv=5)
print('Model Score in itteration 4 for XGBoost Regressor using RandomSearch CV:%.3f%% (%.3f%%)'%(xgbr_gscv_score.mean()*100.0, xgbr_gscv_score.std()*100.0))
parameter_rfr_rcv = {"max_depth": [3, None],
"max_features": sp_randint(1, 11),
"min_samples_split": sp_randint(2, 11),
"min_samples_leaf": sp_randint(1, 11),
"bootstrap": [True, False],
}
# run randomized search
from sklearn.model_selection import RandomizedSearchCV
samples = 10 # number of random samples
rscv = RandomizedSearchCV(rfr, param_distributions =parameter_rfr_rcv, n_iter=samples)
rscv.fit(X_train_scaled, y_train_scaled)
print(rscv.best_params_)
rfr_rscv = RandomForestRegressor(bootstrap= False, max_depth= None,
max_features= 2, min_samples_leaf= 1,
min_samples_split= 4)
rfr_rscv.fit(X_train_scaled, y_train_scaled)
rfr_rscv_score = cross_val_score(rfr_rscv, X_test_scaled, y_test_scaled, cv=5)
print('Model Score in itteration 4 for Random Forest Regressor using RandomSearch CV:%.3f%% (%.3f%%)'%(rfr_rscv_score.mean()*100.0, rfr_rscv_score.std()*100.0))
parameters_xgbr_rscv = {'gamma':[0.1],
'max_depth':sp_randint(1, 11),
'min_child_weight': sp_randint(1, 11),
'num_parallel_tree': sp_randint(0, 2),
'colsample_bylevel': sp_randint(0, 2),
'colsample_bynode': sp_randint(0, 2),
'colsample_bytree': sp_randint(0, 5),
'n_estimators': [100]}
xgbr_rscv = RandomizedSearchCV(xgbr, param_distributions = parameters_xgbr_rscv, cv = 5, verbose = 2, n_jobs=4)
xgbr_rscv.fit(X_train_scaled, y_train_scaled)
xgbr_rscv.best_params_
xgbr_rscv = xgboost.XGBRegressor(colsample_bylevel= 0,
colsample_bynode= 0,
colsample_bytree= 1,
gamma= 0.1,
max_depth= 9,
min_child_weight= 7,
n_estimators=100,
num_parallel_tree= 1)
xgbr_rscv.fit(X_train_scaled, y_train_scaled)
xgbr_rscv_score = cross_val_score(xgbr_rscv, X_test_scaled, y_test_scaled, cv=5)
print('Model Score in itteration 4 for XGBoost Regressor using RandomSearch CV:%.3f%% (%.3f%%)'%(xgbr_rscv_score.mean()*100.0, xgbr_rscv_score.std()*100.0))
# configure bootstrap
n_iterations = 1000 # Number of BootStrap samples to creat
n_size = int(len(df_scaled)*0.75) # Picking only 75% of the given data in every bootstrap sample
values = df_scaled.values
# run bootstrap
stats = list()
for i in range (n_iterations):
# prepare train and test sets
train = resample(values, n_samples = n_size) # sampling with replacement
test = np.array([x for x in values if x.tolist() not in train. tolist()]) # picking rest of the data not considered in sample
# fit the model
model = xgboost.XGBRegressor()
model.fit(train[:,:-1], train[:,-1])
# Evaluate the model
predictions = model.predict(test[:,:-1])
score_test = model.score(test[:,:-1], test[:,-1])
print()
print(score_test)
stats.append(score_test)
# plot scores
pyplot.hist(stats)
pyplot.show()
# confidence interval
alpha = 0.95 # for 95% confidence
p = ((1.0-alpha)/2.0)*100 # tail regions on right and left 0.25 on each side indicated by p value (border)
lower = max(0.0, np.percentile(stats,p))
p = (alpha + ((1.0-alpha)/2.0))*100
upper = min(1.0, np.percentile(stats,p))
print('%0.1f confidence interval %.1f%% and %.1f%%' %(alpha*100, lower*100, upper*100))
result_itteration1
fig = plt.figure(figsize = (22,5))
plt.title ('RMSE values for various Algorithm',y=1, size = 22, color = 'red')
sns.barplot(y = result_itteration1['Root Mean Square Error'], x = result_itteration1['Algorithm'], facecolor = (0.5,0.5,0.5,0.8), linewidth = 10, edgecolor = sns.color_palette ('dark', 12) );
plt.ylabel('RMSE', size = 20)
plt.xlabel('Algorithm', size = 20)
plt.tight_layout()
result_itteration2
fig = plt.figure(figsize = (22,5))
plt.title ('RMSE values for various Algorithm',y=1, size = 22, color = 'red')
sns.barplot(y = result_itteration2['Root Mean Square Error'], x = result_itteration2['Algorithm'], facecolor = (0.5,0.5,0.5,0.8), linewidth = 10, edgecolor = sns.color_palette ('dark', 12) );
plt.ylabel('RMSE', size = 20)
plt.xlabel('Algorithm', size = 20)
plt.tight_layout()
result_itteration3
fig = plt.figure(figsize = (22,5))
plt.title ('RMSE values for various Algorithm',y=1, size = 22, color = 'red')
sns.barplot(y = result_itteration3['Root Mean Square Error'], x = result_itteration3['Algorithm'], facecolor = (0.5,0.5,0.5,0.8), linewidth = 10, edgecolor = sns.color_palette ('dark', 12) );
plt.ylabel('RMSE', size = 20)
plt.xlabel('Algorithm', size = 20)
plt.tight_layout()
From the above iteration, we have taken the models which have least RMSE value and highest model score.
For linear repression, cost function is the RMSE, which is to be minimized.
So the models like Random Forest and XGBoost Repressor are the best models which have least RMSE value and maximum model score.
Based upon this criteria, we can take these two models for further tuning and to squeeze the extra performance out of the model without making those under fit or Over fit.
# plot scores
pyplot.hist(stats)
pyplot.show()
# confidence interval
alpha = 0.95 # for 95% confidence
p = ((1.0-alpha)/2.0)*100 # tail regions on right and left 0.25 on each side indicated by p value (border)
lower = max(0.0, np.percentile(stats,p))
p = (alpha + ((1.0-alpha)/2.0))*100
upper = min(1.0, np.percentile(stats,p))
print('%0.1f confidence interval %.1f%% and %.1f%%' %(alpha*100, lower*100, upper*100))
Since the model XGBoost Regressor has provided the best result by minimizing the cost function and maximizing the model score, we will go with it as a suitable model.
And after tuning the model by employing techniques to squeeze the extra performance out of the model without making it overfit or underfit we achieved a model score of 86.05% with a standard deviation of 6.85 at 95 % of confidence level. So the range estimate of my model score is in between 79.2% and 92.9%.
This range estimate is at 95% confidence level. That means with 95% confidence level we can say our model score will range between 79.2% and 92.9% in production.
Almost the same amount of model score (85.68%) we are also getting out of XGBoost regressor algorithm with a standard deviation of 3.371% while using Randomized Search CV Hyper Parameter model tuning technique.